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Abstract —Cognitive (Radio) (CR) Communications (CC) are 
mainly deployed within the environments of primary (user) 
communications, where the channel states and accessibility are 
usually stochastically distributed (benign or IID). However, many 
practical CC are also exposed to disturbing events (contaminated) 
and vulnerable jamming attacks (adversarial or non-IID). Thus, 
the channel state distribution of spectrum could be stochastic, 
contaminated or adversarial at different temporal and spatial 
locations. Without any a priori , facilitating optimal CC is a very 
challenging issue. In this paper, we propose an online learning 
algorithm that performs the joint channel sensing, probing 
and adaptive channel access for multi-channel CC in general 
unknown environments. We take energy-efficient CC (EECC) into 
our special attention, which is highly desirable for green wire¬ 
less communications and demanding to combat with potential 
jamming attack who could greatly mar the energy and spectrum 
efficiency of CC. The EECC is formulated as a constrained regret 
minimization problem with power budget constraints. By tuning 
a novel exploration parameter, our algorithms could adaptively 
find the optimal channel access strategies and achieve the almost 
optimal learning performance of EECC in different scenarios 
provided with the vanishing long-term power budget violations. 
We also consider the important scenario that cooperative learning 
and information sharing among multiple CR users to see fur¬ 
ther performance improvements. The proposed algorithms are 
resilient to both oblivious and adaptive jamming attacks with 
different intelligence and attacking strength. Extensive numerical 
results are conducted to validate our theory. 

Index Terms —Energy Efficiency, Cognitive Radio, Online 
learning, Jamming attack and Multi-armed bandits 

I. Introduction 

Cognitive (Radio) (CR) Communications (CC) are widely 
recognized as one of the promising Information and Com¬ 
munication technology (ICT) to release the tension of current 
spectrum-scarcity issue. Meanwhile, as growing explosively, 
ICT is playing a more and more important role in global 
greenhouse gas emissions, the energy-consumption of which 
contributes to 3 percent of the worldwide electric energy 
consumption nowadays m. Thus, Energy-Efficient (EE) CC 
(EECC) has received great attention from the research com¬ 
munity in recent years 0. Admittedly, the joint design of 
channel sensing, probing, and accessing (SPA) scheme with 
the consideration of energy efficiency (EE) is pivotal for CC. 
Stimulated by the recent appearance of smart CR devices 
with adaptive and learning abilities, modern CCs have raised 
very high requirements to its solutions, especially in complex 
environments, where accurate channel distributions and states 
can barely be modeled and acquired due to unpredictable 
Primary User Activity (PUA) [3'| in Primary Communications 
(PC), behaviors of other CC, potential jamming attacks, and 
other distributing events frequently seen in practice. Thus, 
it is critical for CR devices to learn from the environments 


and keep a good balance of allocating its transmission power 
wisely to achieve the goal of energy-efficiency and of designing 
almost optimal channel access schemes to reach the goal of 
spectrum-efficiency in EECC. 

Undoubtedly, the communication model has a great impact 
on the performance of CC. A great amount of works assume 
priorly known statistical information and have proposed deter¬ 
ministic channel states models, e.g., POMDP q, and accessi¬ 
bility models, e.g., Poisson Modeling of PUA (8) to make good 
approximations in benign wireless environments. Clearly, they 
are not suitable for complex or even unknown environments. 
To cope with the problem, a fairly reasonable and realistic line 
of studies assume no statistical prior information about the 
channel states and accessibility. Thus, online learning based 
methods (e.g., reinforcement learning (RL) Eol) are desirable 
to be adopted, e.g., lfl3l [211 [22| (2S). Within this context, 
the use of the Multi-armed bandit (MAB) theory lf20l is highly 
identified over other learning approaches. 

In summary, these works assume that the nature of CC 
environments is either stochastic (benign), where the channel 
state and accessibility are stochastically distributed [26] (IID), 
or adversarial ED E2, where they can vary arbitrarily 
(adversarial or non-IID) by jamming attackers or adversarial 
UPA, etc. Respectively, these works are mainly categorized 
into two MAB models, namely, stochastic MAB mm 
with IID assumption and adversarial MAB ED E3 with 
non-IID assumption. Accordingly, the analytical approaches 
and results for the two models are distinctively different. Note 
that the learning performance is qualified by the classic term 
“ regret ”, i.e., the performance difference between the proposed 
learning algorithm and the optimal one known in hindsight. 
A well-known fact is that stochastic MAB and adversarial 
MAB have the respective optimal regrets 0(log(n )) [18] and 
0(y/n) [19] over time n. Obviously, the stochastic MAB 
highly outperform that of the adversarial MAB in learning 
of convergence to the optimal strategies. 

However, all related works FlQll- lfl2t ED—EHl still rely on 
the priori of either the stochastic or the adversarial assumption, 
which is limited in describing practical CC environments. 
Because, the nature of the practical CC environments are 
not restricted to these two types and it usually can not be 
known in advance. On the one hand, consider a CC under 
potential jamming attack. Since the number and locations of 
jamming regions are often unrevealed, it is uncertain which 
regions may (or may not) suffer from the attack. Thus, the 
usual mind of applying adversarial MABs models fTH f22l on 
all channels will lead to large values of regret, since a great 
portion of channels can still be stochastically distributed, while 
applying the stochastic MABs models is not feasible due to 


the existence of adversaries. 

On the other hand, the stochastic MAB model ED d, 
E9, IM ED will face practical implementation issues. 
In almost all CC systems, the commonly seen occasionally 
disturbing events would make the stochastic channel distri¬ 
butions contaminated. These include the burst movements of 
individuals, the spectrums handoff and mobility 0| among 
users of PC and CC, and the jitter effects of electronmagnetic 
waves, etc. In this case, the channel distribution will not follow 
an IID process for a small portion of time. Thus, it is not 
clear to us whether the stochastic MAB is still applicable, 
how the contamination affects the learning performance and 
to what extent the contamination is negligible. Therefore, the 
design of a unified SPA scheme without any prior knowledge 
of the operating environment is very challenging. It is highly 
desirable and bears great theoretical value. 

In this paper, we propose a novel adaptive multi-channel 
SPA algorithm for EECC that achieves almost optimal learning 
performance without any a priori of the CC environments. 
Importantly, we take EE into our special consideration with 
power budget constraints on each of the multi-channel access 
strategy. As such, our work can be regarded as the first work 
for the EECC in unknown environments, where optimal strate¬ 
gies can be gradually learned. Our innovative SPA scheme 
is based on the famous EXP3 (33l algorithm in the non¬ 
stochastic MAB with three main features: 1) We introduce a 
new control parameter into the exploration probability for each 
channel to facilitate automatically detection of the feature of 
environments; 2) we use and design the Lagrangian method 
delicately to model the the power budget constraints for our 
own EECC problem; 3) By joint control of learning rate 
and exploration probability, the proposed algorithm achieves 
almost optimal learning performance in different regimes with 
vanishing (sublinear) long-term power budget violations. Our 
main contributions are summarized as follows. 

1) We define an appropriate EE model that is suitable 
for SPA scheme-based EECC over large spectrum pools and 
with fairness considerations. We categorize the features of the 
EECC environments mainly into four typical regimes, each 
of which are proved to achieve the almost optimal regret 
bounds with sublinear long-term power budget violations. 
Our proposed algorithm neither need to distinguish the type 
of PC, other CC and adversarial (jamming) behaviors, nor 
need to know the channel accessibility and quality within 
all the different features of the complex environments. Thus, 
it provides a complete solution for practical CC in general 
unknown environment. 

2) The proposed AOEECC-EXP3++ algorithm considers in¬ 
formation sharing of channels that belong to different channel 
access strategies, which can be regarded as a special type 
of combinatorial semi- bandifl problem. In this case, given 
the size of all channels K and the number of transmitting 
channels k , the AOEECC-EXP3++ algorithm has the optimal 
tight regret bounds in both the adversarial settings f38l and the 
stochastic settings ED, which indicates the good scalability 

'The term first appears in ED, which is the combinatorial version of the 
classic MAB problems. 


for different size of CC systems. 

3) This work is also the first MAB-based constrained regret 
minimization (optimization) framework for CC in unknown 
environments in the online learning setting. Our proposed al¬ 
gorithms have polynomial time implementations, which result 
in good computational efficiency in practice. 

4) We propose a novel cooperative learning algorithm that 
considers information sharing among multiple users of the CC 
systems to accelerating the learning speed of each individual 
users, which is desirable for the widely acknowledged feature 
of CC systems with cooperative spectrum sensing and sharing 
schemes 0. It further improves the energy-efficiency and 
spectrum-efficiency of the EECC within a fixed time period. 

5) We conduct plenty of diversified experiments based on 
real experimental datasets. Numerical results demonstrate that 
all advantages of the proposed algorithms are real and can be 
implemented easily in practice. 

The rest of this paper is organized as follows: Section II 
discusses Related works. Section III describes the problem 
formulation. Section IV introduces the distributed learning al¬ 
gorithm, i.e, AOEECC. The performance results are presented 
in Section V. The multi-user cooperative learning algorithm 
is discussed in Section VI. Proofs of previous sections are in 
Section VII and Section VIII. Simulation results are available 
in Section IX. The paper is concluded in Section X. 

II. Related Works 

Recently, online learning approach to address the dynamic 
channel access (DSA) problem in CC with less prior channel 
statistical information have received more and more attention 
than classic deterministic model approaches, e.g. channel 
states 0 and accessibility modeling 0. The characteristics 
of repeated interactions with environments are usually cate¬ 
gorized into the domain of RL j40|, e.g. DSA by RL fl3l . 
anti-jamming CC by RL flAl - HTl . It is worth pointing out 
that there exists extensive literature in RL, which is generally 
targeted at a broader set of learning problems in Markov 
Decision Processes (MDPs). The RL approach guarantees 
the performance asymptotically to infinite. Hence, it is not 
quite suitable for mission-critical advanced applications of 
CC, which is commonly seen in next generation wireless 
communications. By contrast, MAB problems constitute a 
special class of MDPs, for which the no-regret learning frame¬ 
work is generally viewed as more effective in terms of fast 
convergence time, finite-time optimality guarantee (39j, and 
low computational complexity. Moreover, it has the inherent 
capability in keeping a good balance between “exploitation” 
and “exploration”. Thus, the use of MAB models is highly 
identified. 

The works based on the stochastic MAB model often 
consider about the stochastically distributed channels in benign 
environments, such as (23) (24) (26l - ll28l (36l . The adversarial 
MAB model is applied to adversarial channel conditions, such 
as the anti-jamming CC ED El- In the machine learning 
society, the stochastic and adversarial MABs have co-existed 
in parallel for almost two decades. Only until recently, the first 
practical algorithm for both stochastic and adversarial bandits 
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is proposed in (35l for the classic MAB problem. The current 
work uses the idea of introducing a novel exploration param¬ 
eter (35). But our focus is on the much harder combinatorial 
semi-bandit problem that needs to exploit the channel depen¬ 
dency among different SPA strategies, which is a nontrivial 
task. Moreover, our introducing of the Lagrangian method into 
the online EECC problem leads to an important finding that 
we need to set the learning rate and exploration probability 
together and the same for all regimes (as we defined) rather 
than could be adjusted separately for stochastic and adversarial 
regimes in (35). This phenomenon indicates that the online 
learning for the EECC in unknown environments is a harder 
problem than classic regret minimization without constraints 

ED El £3). 

The topic of EECC has recently received great attention in 
wireless communications society 0 due to the stimulation 
of green communications for ICT. The spectrum efficiency 
and energy efficiency are the two critical concerns. Almost 
all of them consider about deterministic channel state and 
accessibility models (42) - (47) for DSA in CC. Some of the 
works try to achieve the spectrum efficiency (43), energy 
efficiency (42) . while others try to achieve both goals (44l - 
s 3 . Being worthy of mention, there are a small amount of 
works focus only on optimization of the EE for spectrum 
sensing part [48] (50l within the whole CC circle. This part 
of energy cost is comparatively minor in scales when compare 
to circuit and transmission energy cost 0 , which can be 
categorized into the circuit and processing energy cost as in 
classic wireless communications 0. 

Recently works (T¥l IT9) have used the exponential weights 
(similar to EXP3) MAB model to study the no-regret (sublin- 
ear) online learning for the EE of OFDM and MIMO-OFDM 
wireless communications. However, the problems are different 
from the EECC, and the dynamic channel evolution process 
is only assumed to be adversarial. Thus, our work is the 
first SPA scheme for EECC in general unknown environments 
that targets on both spectrum efficiency and energy efficiency 
without any deterministic channel model assumption. 

III. Problem Formulation 
A. Cognitive Communication Model 

We first focus on EECC from the perspective of a single CR 
user (or called “secondary user” (SU)), which is distributed 
or uncoordinated with other CR users. It is consisted by 
a pair of transmitter and receiver within the region of PC. 
The transmitter sends data packets to the receiver synchron- 
ically over time with classic slotted model. The wireless 
environment is highly flexible in dynamics, i.e., besides the 
most influential PUA of a number of M PUs that affect 
the CC’s channels’ qualities (states) and accessibility, there 
are interference from other SU transceiver pairs, potential 
jamming attack and channel fading, etc, would make the 
environment to be generally unknown. During each timeslot, 
the SU transmitter selects multiple channels k to transmit data 
to the receiver over a set [K] = {ci, C 2 ,c/, ck} of K 
available orthogonal channels with possibly different data rates 
across them. When a channel Cf is occupied by primary user 


(PU), it is called as busy , otherwise, it is called idle. However, 
the busy (or idle) probability of PU is not unknown. There 
are a set [L\ = {Si, S 2 , •••? S u , ..., Sl} of L SU transceiver 
pairs making contention or interfering power among each 
other. However, the behavior is transparent to a single SU 
S u . W.l.o.g, if there are adversarial events, we ascribe them 
to be launched by one jammer who attacks the set or a subset 
of K channels, where its attacking strategies are unrevealed. 
At each time n , the EE calculated from the allocated power 
and received data rate of the SU u on channel / is denoted 
by 9n,u(f), 9n,u(f) £ [0, M\. We omit subscript u if there is 
no confusion from context. Here constant M is the maximum 
value of EE for all channels. W.l.o.g., we normalize M = 1 
as usual in the regret analysis. 

We employ the classic energy detection method 0 for spec¬ 
trum sensing, i.e., if the transmitting single strength is above 
a threshold, we regard the channel is busy or attacked, i.e., 
9n{f) = 0. Otherwise, CC are allowed and g n (f) is released 
for the frequency / even though there are other potential PUs, 
jammers and CR users transmitting with low interfering power. 
Thus, our model is suitable for both spectrum overlay and 
spectrum underlay 0 schemes. We assume that each radio 
out of the k radios on the SU transmitter needs time n s for 
sensing the status of a channel and time n p for probing its 
quality. The actual time depends on the technology and device: 
the typical values of t s is about 10ms and t p is from 10ms 
to 133ms 0 . Let t sp =t s + t p . When a channel / G [K] is 
idle, transmitter/reciever can only access it for at most t a time 
at most, so it can detect the return of a PU. In practice, t a 
has the typical value of 2s. Let 1 n (/) be the indictor function 
that denote whether the SU decides to transmit data using the 
probed channel 


B. Preliminary in EECC with Deterministic CSI 

Before the discussion of our own problem, let us first review 
the classic EECC with deterministic channel accessibility and 
states. For the multi-channel CC, SUs only know its own 
payoff and strategy for each channel / at timslot t , i.e., the 
realized transmission rate r t (f) and transmission power P t (f). 
At each timeslot t , each SU chooses a subset of channels 
over n according to some sensing/probing rules, where the 
multi-channel access strategy is denoted by i and we have 
/ £ i C S. the transmission power for each channel is 
Pt(if),l < if < fc, 1 < i < N, and the total transmission 
power over a strategy i is P t (i), and P t (i) = 1 P t(if)- 

Then, the instant data transmission rate r t (f) for the SU at 
each selected channel / is given as 
r t (/) = IUlog 2 (l+tfSINR(P t )) = W • !*(/)• 
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where gff(s{) and g/j(s J t ) are the respective channel gains 
from itself and other SUs with instant channel states^/ and si, 
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b{ and f>( 1 are the respective interfering power and channel 
gain from the PU l, a{ and zu( a are the respective interfering 
power and channel gain from the jammer J, and (cr/) 2 is the 
background noise power. The unit of r t (f) is nats/s. 

In traditional wireless communications, the EE of the multi¬ 
channel or OFDM (e.g. 0 (ED) wireless systems with the 
number of subchannel (subcarrier) k at timeslot t is defined 
as 


£/=i r t (if) 


EE t = - — ~t - nats/J , (1) 

Pc(0 + E}=iPt(i/) 

where P/ is the processing and circuit power consumption at 
time t while P t (if) is the transmission power for each sub¬ 
channel (sub-carrier) /. 

By contrast, the definition of EE in EECC is slightly 
different. Because the multi-channel CC is not restricted to a 
pre-defined fixed set of OFDM (OFDMA) channel sets, where 
multi-radio based spectrum sensing and channel probing are 
necessary to scan from a large spectrum pool separately for a 
group of (potentially) nonconsecutive and distributed channels 
with the best channel sensing/probing qualities for general 
CC systems 0. As such, the measurement of EE for EECC 
is from the view of each sensed/probed transmitting channel 
within the SPA scheme, i.e., 


EEt f = —f --- nats/J. (2) 

,f Pc(*/) + P t(if) ' 

Then, the overall average EE for the each SPA strategy i is 
given as 

EE t cc = I Y/f-i EEt ’f nats / J ’ v / e *• (3) 

Note that the sensing and probing energy consumption are 
also categorized into P/(if) with Y^ k f=i ^c^f) = Pc (20 
and ^2f = iPt(if) = P t(i)- A simple fact about the rela¬ 
tion of © and © is that max/EE^ c (/) > EE^ C > 
EEj > min / EE t cc (/). When V/ = /', EEt,/ = EEt,//, 
EE ( ; :C = EE f . Thus, maximize EE CC will push the fairness of 
EE among different channels. Incorporated with sensing and 
probing, and after determined the channel access strategy i, the 
EECC can be formulated as the following nonlinear program, 
max EEf c 

subject to P t(i) < P 0 , 

where each SU has a power budget P 0 . By similar approaches 
in 0 that the problem (0]) is also quasi-concave with respect 
to P t (if) 9 where water-filling method can be used to resolve 
the problem. Moreover, the definition of EE^ C enables the 
information sharing of EE t j for each channel among different 
strategies, which is specially suitable for EECC design over 
large spectrum pools. 


C. The Adaptive Online Learning for EECC: A Constrained 
Regret Minimization Formulation 

In reality, since no secret is shared and no adversarial event 
is informed to the transceiver pair, the multi-channel EECC 

2 More precisely, we could divide the circuit and processing power among 
the k channels according to the bandwidth of each channel and calculate the 
sensing and probing energy cost based on the monitoring of each channel. 
Roughly speaking, we can simply do an energy-cost division among all 
channels k. 


in unknown environments are necessary to sensing/probing 
and hoping among different channels to dynamically access a 
subset to maximize its accumulated EE over time. Namely, this 
sequential channel sensing/probing/accessing (SPA) problem 
is to determine when to conduct the channel hopping (multi¬ 
channel access) and power allocation repeated game with 
environments, without knowing instant channel states, for a 
pair of CR user transceiver so as to improve the EE of CC. 
The difference of our SPA problem (based on MAB) with the 
classic MAB problem is that, at every timeslot t, the classic 
MAB receives a reward and repeat this for T timeslots; while 
for the SPA problem, at a timeslot t , we will not have any 
gain if the the CR users donot happen to use the channel for 
data transmission after the sense/probe of the channel. 

To address this issue, we only need to count the timeslots 
spent for sensing/probing a chosen channel a round (Or still 
say “timeslot”, if no confusion). The immediate following 
timeslot spent for data transmission over a chosen multiple 
channel set are not counted as a round. However, we will 
calculate and treat the averaged EE © from © based on 
the previous transmitted data and the chosen transmission 
power and known circuit and processing power P*(i/) for 
each sensed/probed channel /, where its gain is g t (f ) = 
1 t(f) • y Jo * EEt,/ • (1 - Pr t(PU, J, SU)), where Z t denotes 
the time of the actual transmission and Pi t (PU,J,SU) are 
the probability that the transmission will be destroyed by 
the return of PU, jammer or some other SUs within the Z t 
time duration. We set that t a = Z t . Let n be the number 
of sensing/probing timeslots executed during the whole run 
of the system evolution duration T, which should satisfy the 
condition n • t sp + Ylt=i W * < T. The first part is the 

time spent for sensing/probing and the second part is the time 
spent for multi-channel EECC. 

For the multi-channel accessing part, let us denote {0, 1} K 
as the vector space of all K channels. The strategy space for 
the transmitter is denoted as S C {0, 1} K of size N = (J). If 
the /‘^-channel is selected for transmitting data, the value of 
the /-th entry of a vector (channel access strategy) is 1, and 0 
otherwise. In the case of the existence of jamming attack on 
a subset of kj channels, the strategy space for the jammer is 
denoted as Sj C {0, 1} K of size (^). For convenience, we 
say that the /-th channel is jammed if the value of /-th entry 
is 0 and otherwise is 1. 

Formally, our MAB-based SPA problem is described as 
follows: at each timeslot n = 1,2,3,..., the transmitter (as 
a decision maker) selects a strategy I n from S r with a power 
strategy Pt(//),V/ G I n . The cardinality of S r is \S r \ = N. 
The reward g n (f ) is assigned to each channel / G {1,..., K} 
and the SU only gets rewards in strategy i G S r . The total 
reward of a strategy i in timeslot n is g n (i) = 

Then, on the one hand, the cumulative reward (or EE) up to 
timeslot n of the strategy i is G n ^ = EE^ C = Ylt=i W = 
On the other hand, the total reward over 

all the chosen strategies by the receiver up to timeslot n 

^ /v cc 

is Gn = EE n = YZ=i9t(It) = £?=i£/e/ t flt(/)> 

where the strategy I t is chosen randomly according to some 
distribution over S r . The performance of this algorithm is 
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where E[P t (i)] = P(i). W.l.o.g., we assume P t G [0,1]^ and 
P 0 G [0,1]. Formally, the goal is to attain a gradually vanishing 
constrained regret as 


Regret n = i?(n) p T P < 0(n 1_/31 ). (8) 

Furthermore, the decision p n made by the learner are required 
to attain sublinear bound on the violation of the constraint in 
long run , i.e., 


Violation^ = 


E(p* p - p °) 


t =1 


< Oijn 1 /32 ). (9) 


In contrast to the short-term constraints that the constraint ® 
is required to be satisfied at every timeslot, SUs are allowed 
to violate the constraints for some rounds in a controlled way; 
but the constraints must hold on average for all rounds, i.e., 
(£r=ip? p )/™< p o- 


Fig. 1: Multi-channel EECC in Different Regimes 


qualified by regret R(n ), defined as the difference between the 
expected number of successfully received data packets using 
our proposed algorithm and the expected rewards that use the 
best fixed solution up to n, i.e., 

R(n) = maxEj^p {G nj i} — E[G ra ], (5) 

zE-sy 


where p n is the decision probability vector over all strate¬ 
gies and the maximum is taken over all available strategies. 
However, if we use the gain (reward) model, we will face 
technical difficulties as presented in [20] (pages 25-28). Thus, 
we can introduce the loss model by the simple trick of 
t-ntf) = 1 - 9n(f) for each channel / and i n (i) = k - g n (i) 
for each strategy to avoid this issue. Then, we have L n (i) = 
nk - G U: i where L n (i) = EtU^W = EtU E/ e ^t(/)> 
and similarly, we have L n = nk — G n . Use E n [-] to denote 
expectations on realization of all strategies as random variables 
up to round n, the expected regret R(n) can be rewritten as 


R(n) = E 


_t= 1 


= E 


Lnr 


minE^p {L n (i)} 

ieS r 


( 6 ) 


n 


n 


= E[£ E*[ E W)}} - mm(E[E E*[E W)\)- 

t =i feh ieSr t=l fei 

The goal of the algorithm is to minimize the weak regret (20), 
or simply called regret. For AOEECC, in addition to rewards, 
there are power budget constraints on the decision of trans¬ 
mission power P t (f) that need to be satisfied. Particularly, for 
the decision p n made by the learner for each channel access 
strategy, the power budget constraint can be written as 

P*P» < P„. (7) 


Note that the SUs of the CC need to make decisions p n 
that attains maximal cummulative reward while satisfying the 
additional constraints ©. 


Within our setting, we refer this problem as the con¬ 
strained regret minimization problem. More precisely, let 
P = {P(1),..., P(TV)} be the constraint vector defined over 
power allocation actions. In stochastic setting, the vector V 
is not predetermined and is unknown to the learner. In each 
timeslot t, beyond the reward feedback, the SU receives a 
random realization P t = {Pt(l),P*(z),..., P t (N)} of V, 


D. The Four Regimes of Wireless Environments 

Since our algorithm does not need to know the nature of the 
environments, there exist different features of the environments 
that will affect the its performance. We categorize them into 
the four typical regimes as shown in Fig. 1. 

1) Adversarial Regime: In this regime, there is a jammer 
sending interfering power or injecting garbage data packets 
over all K channels such that the transceiver’s channel rewards 
are completely suffered by an unrestricted jammer (See Fig.l 
(a)). Usually, the EE will be significantly reduced in the 
adversarial regime. Note that, as a classic model of the well 
known non-stochastic MAB problem ED, the adversarial 
regime implies that the jammer often launches attack in every 
timeslot. It is the most general setting and other three regimes 
can be regarded as special cases of the adversarial regime. 

Attack Model: Different attack philosophies will lead to 
different level of effectiveness. We focus on the following two 
type of jammers in the adversarial regime: 

a) Oblivious attacker: an oblivious attacker attacks differ¬ 
ent channels with different attacking strength as a result of 
different EE reductions, which is independent of the past 
communication records it might have observed. 

b) Adaptive attacker: an adaptive attacker selects its at¬ 
tacking strength on the targeted (sub)set of channels by 
utilizing its past experience and observation of the previous 
communication records. It is very powerful and can infer the 
SPR protocol and attack with different level of strength over 
a subset of channels during a single timeslot based on the 
historical monitoring records. As shown in a recent work, no 
bandit algorithm can guarantee a sublinear regret o(t) against 
an adaptive adversary with unbounded memory, because the 
adaptive adversary can mimic the behavior of SPR protocol 
to attack, which leads to a linear regret (the attack can not be 
defended). Therefore, we consider a more practical 0-memory- 
bounded adaptive adversary [29]] model. It is an adversary 
constrained to loss functions that depends only on the 6 + 1 
most recent strategies. 

2) Stochastic Regime: In this regime, the SU’s transceiver 
communicating over K stochastic channels within PC is 
shown in Fig.l (b). The channel loss 4i(/),V/ G 1 
(Obtained by transferring the reward to loss £ n (f) = 1 — 
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Qn (/)) of each channel / are sampled independently from an 
unknown distribution that depends on /, but not on n. We use 
fif = E [4i(/)] 1° denote the expected loss of channel /. We 
define channel / as the best channel if p(f) = min//{//(/')} 
and sub optimal channel otherwise; let /* denote some best 
channel. Similarly, for each strategy i G S r , we have the 
strategy p{i) = mhty{^yr Gi/ /i(/)} and suboptimal 
strategy otherwise; let i* denote some best strategy. For 
each channel /, we define the gap A(/) = p(f) — p(f*); 
let A f = minj :A (j) >0 {A(/)} denote the minimal gap of 
channels. Let N n (f) be the number of times channel / was 
played up to time n, the regret can be rewritten as 

i?(n) = ^ / E[7V„(/)]A(/). (10) 

Note that we can calculate the regret either from the per¬ 
spective of channels / G 1,..., K or from the perspective of 
strategies i G S r . However, because of the set of strategies is of 
the size (^) that grows exponentially with respect to K and 
it does not exploit the channel dependency among different 
strategies, we thus calculate the regret from channels, where 
tight regret bounds are achievable. 

3) Mixed Adversarial and Stochastic Regime: This regime 
assumes that the jammer only attacks kj out of k currently 
chosen channels at each timeslot shown in Fig.l (c). There 
is always a kj/k portion of channels under adversarial attack 
while the other ( k — kj)/k portion is stochastically distributed. 

Attack Model: We consider the same attack model as in 
the adversarial regime. The difference here is that the jammer 
only attacks a subset of size kj over the total k channels. 

4) Contaminated Stochastic Regime: The definition of this 
regime comes from many practical observations that only a 
few channels and timeslots are exposed to the jammer or 
other disturbing events in CC. In this regime, for the oblivious 
jammer, it selects some slot-channel pairs (t , /) as “locations” 
to attack, while the remaining channel weights are generated 
the same as in the stochastic regime. We define the attacking 
strength parameter ( G [0,1/2). After certain r timslots, for 
all t > r the total number of contaminated locations of each 
suboptimal channel up to time t is tA(f)( and the number of 
contaminated locations of each best channel is t Af(. We call a 
contaminated stochastic regime moderately contaminated , if ( 
is at most 1/4, we can prove that for all t > r on the average 
over the stochasticity of the loss sequence the attacker can 
reduce the gap of every channel by at most one half. 

IV. The AOEECC Algorithm 

In this section, we focus on developing an AOEECC algo¬ 
rithm for the SU. The design philosophy is that the transmitter 
collects and learns the rewards of the previously chosen chan¬ 
nels, based on which it can decide the next timeslot channel 
access strategy, i.e., the SU will decide whether to transmit 
data over the current channel set (called exploitation) or to 
continue sensing/probing some other channels for accessing 
(called exploration). 

We describe the Algorithm 1, namely AOEECC-EXP3++, is 
a combinatorial variant based on EXP3 algorithm. Before we 
present the algorithm, let us introduce the following vectors: £ t 
is all zero vector except in the I t th channel access strategy and 


Algorithm 1 AOEECC-EXP3++: An e-SPA Scheme for multi- 

channel EECC _ 

Input: K,k,n. See text for definition of rj n , £ n (/), and S n . 


Initialization: Set initial channel and strategy losses Vi G 
[N], Lo(i) = 0 and V/ G [K],Io(f) = 0, respectively; Then 
the initial channel and strategy weights Vi G [N], Wo(i) = k 
and V/ G [K],wo(f) = 1, respectively. The initial total 
strategy weight Wo = N = (^). 

Set: = £«(/) = min {^, /?„,£„(/)} > v / G 

[K] and j n = Y,f=i £«(/)• 

for timeslot n = 1, 2, ... do 

1: Based on sensing and probing results, randomly selects 
a channel access strategy I n according to the strategy’t 
probability p n (i),V/ G [. K ], with p n (i) computed as 
follows: 


'(1-7n) 2 ^+£/*£»(/) (fie C 

' (1 - In) ifi$C 

2: Computes the probability g n (/),V/ G [K], 


Pn{i) = 


(ID 


pnU) v, /C( .,,„(>•) - (i - 1,1 l2) 

+ I {i C C : f G i}| . 

3: Sense and probe channels for / G / n . Receive the 
scaled (i.e., in the range [0,1]) loss model (converted from 
reward model) of reward and power , and then calculate 
EE for channel /, £ n -\ (/), and the realization of power 
budget P n (/), V/ G I n . Update the estimated loss with 
augmented power allocation constraint 'ipn(f), V/ G [K] 
as follows: 


= 


"(13) 


4-1 (/) + \i-l P n-l (/)i if f € In 

0 otherwise . v 

4: Update the Lagrangian Multiplier by the following 
equation: 


— [(1 $n— lhn— 1 \J r )n— 1 )^n—1 

— ^?n-l\/7n-l(Po — P n _iPn—1)] + - 
5: The receiver updates all the weights as 

w n (/) = w n ^i (/) = e-^»*»(/) } 

Wn (i) = H . w n(f) = W n -1 (i) e~ Vn 't’ n< ' l \ 

The sum weights of all strategies is W n = ^2 ie s (/)• 
6: Access each of the channel / G I n with probability e, 
i.e., set 1 n (f) = 1 with probability e. 

end for 


so does the channel loss £t(f) within the I t , V/ G It , we have 
£t(f) = £t(f)/Pn(f)- Similarly P t is all zero vector except 
in I t th channel access strategy and so does the power P t (f) 
within the I t , V/ G / n , where we have P t (/) = P t(f)/Pn(f)- 
It is easy to verify E. i t [I t (f)] = £t(f) and E it [P t (/)] = 
P*(/), where p nJ = (p n (l),p n (f),..., p n (K)) and p n = 
(p n ( 1), In addition, we have the follow- 
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ing equalities at step 5 of Algorithm 1. 

’M/) = + V’n-l(/), *„(*) = ^n-l(») + $n-l(*) 

fn(/) = + A„-lf „_!(/), 

^n(^) = L/ n ~i (i) A n _if n _i(i), 

£„(/) = L„_i(/) +4-i(/),in(i) = in-i(i) +4-1(4 j 

r„(/)*f n -.j(/) + p n _ 1 (/),f n (i)p=f n _i(i) + p n _ 1 (i) 
^n-l(i) = E/ejT-l(/)> Pn—l(i) = E/gi Pn—1(/)> 
where L n (f) and T n (/) are the respective accumulated esti¬ 
mated loss and allocated power on channel / up to round n, 
and Z n (i) and f n (i) are the respective accumulated estimated 
loss and allocated power on strategy i up to round n. Moreover, 
we have the exploration probability been decomposed for each 
channel, where we have y n = Y/=i £ n{f )• 

Our new algorithm uses the fact that when losses (converted 
from rewards) and power of channels in the chosen strategy 
are revealed, it also shares this information with the common 
channels of the other chosen strategies. During each timeslot, 
we assign a channel weight that is dynamically adjusted based 
on the channel losses revealed. The weight of a strategy is 
determined by the product of weights of all channels. Our 
algorithm has two control levers: the learning rate r] n and the 
exploration parameters £ n (/) for each channel /. To facilitate 
the adaptive channel access to optimal solutions without the 
knowledge about the nature of the environments, the crucial 
innovation is the introduction of exploration parameters £ n (/), 
which are tuned individually for each arm depending on the 
past observations. 

A set of covering strategy is defined to ensure that each 
channel is sampled sufficiently often. It has the property that, 
for each channel /, there is a strategy i £ C such that / G i. 
Since there are only K channels and each strategy includes k 
channels, we have \C\ = |~^~|. The value Yfei £ n{f) means 
the randomized exploration probability for each strategy i G C, 
which is the summation of each channel /’ s exploration prob¬ 
ability £n(f) that belongs to the strategy i. The introduction 

of Yfei £n (/) ensures that p n (i) > Yfei 5 n(/) so that il is 
a mixture of exponentially weighted average distribution and 
uniform distribution m over each strategy. 

In the following discussion, to facilitate the AOEECC- 
EXP3++ algorithm without knowing about the nature of 
environments, we can apply the two control parameters si¬ 
multaneously by setting r\ n = /3 n , 2 kKr] < 5 = 0(rj n ) and 
use the control parameter £ n (/) such that it can achieve the 
optimal “root-n” regret in the adversarial regime and almost 
optimal “logarithmic-n” regret in the stochastic regime. 

V. Performance Results of EECC under e-SPA 

This section analyzes the regret and power budget violation 
performance of our proposed AOEECC-EXP3++ algorithm in 
different regimes. 

A. Adversarial Regime 

We first show that tuning r] n and £ n (/) together, we can 
get the optimal regret (of reward and violation) of AOEECC- 
EXP3++ in the adversarial regime, which is a general result 
that holds for all other regimes. Define G n (e) as the expected 


average EEs that can be achieved by the e-SPA scheme over 
n rounds. The Theorem 1, Theorem 3, Theorem 5, Theorem 
7, Theorem 9 and Theorem 11 bound the regret of EE, 
max ieSr E^ Pn {G ni i} - E[G n (l)] when set e = 1. 

Theorem 1. Under the oblivious jamming attack, no matter 
how the status of the channels change (potent ially in an 
adversarial manner), for rj n = (3 n , S n = 2 k^J and 

any £ n (f) = 0(l/n), the regret of the AOEECC-EXP3++ 
algorithm for any n satisfies: 

Regret n = i?(n) pTp < 4 ks/nK In K = 0(n 1//2 ). 

Violation = EEtLiPt P t ~ Po]+ < 0(ni). 

From Theorem 1, we can find that the regret is order and 
leading factor optimal when compared to the results in the 
anti-jamming wireless communications f25l . For the power 
budget violation, we have a regret of sublinear 0(n 4 ). From 
the proof of the Theorem, this upper bound may be very loose. 

According to the e-SPA scheme, CR user will transmit 
en times in expectation during n rounds. It is easy to 
show E[G n (e^] = eE[G n (l)], which implies E[G n (e)] > 
e max^ G £ r E[G n?i ] — 4 eks/nK hiK. Let G max be the large 
expected data rate of channel access strategies among all 
the strategies. We have max^ e s r E^ Pn {G nj i} = n • G max - 
Assume t a = at sp where constant a 1. Then we have 
Theorem 2. The expected EE of e-SPA scheme of 
AOEECC-EXP3++ in the adversarial regime under the obliv¬ 
ious jammer is at least 


G max 4 k 


G„(e)t 0 ^max 


K\nK 


Gmax 4 k 


(1-f ae)t sp K In K 


7TT + 1 


T ~ J- + 1 

Oi € ug 

where T = nt sp + ent a . 

We find that when T is sufficiently large, the achievable 


expected EE is at least 


W7 + 1 ’ 


which is maximized when 


e = 1. Obviously, the expected EE that can be achieved is no 
more than , because each transmission takes 


tsp~\~t a ~+l 

at least t sp + t a time while the expected EE is no more than 
Gmax- Thus, when T is sufficiently large, the e-SPA scheme 
of AOEECC-EXP3++ is almost optimal. Similar conclusions 
holds also for the following Theorem 4, Theorem 6, Theorem 
8, Theorem 10, Theorem 12 and Theorem 14. 

Theorem 3. Under the O-memory-bounded adaptive jam¬ 
ming attack, for r\ n = /3 n , S n = and any 

£ n (/) > 0, the regret of the AOEECC-EXP3++ algorithm 
for any n is upper bounded by: 

Regret n = i7(n) pTp < O((0 + 1)(4 k\/K In K)ini). 

Violation = EE^ =1 p^P t - P G ] + < 0{ni). 

Theorem 4. The expected EE of e-SPA scheme of 
AOEECC-EXP3++ in the adversarial regime under the 0- 
memory-bounded adaptive jammer is at least 

Gn{y)^a ^ Gmax 


-(^ + l)(4fcv / j ; nnXr) 5 ((l+aeT P )s 


± + l 

ae 

sp t ent a . With sufficiently large T, our e-SPA 


T 

where T = nt, 
scheme of AOEECC-EXP3++ is almost optimal. 


B. Stochastic Regime 

We consider a different number of ways of tuning the explo¬ 
ration parameters £ n (/) for different practical implementation 
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considerations, which will lead to different regret performance 
of AOEECC-EXP3++. We begin with an idealistic assumption 
that the gaps A(/),V/ G K is known in Theorem 5, just to 
give an idea of what is the best result we can have and our 
general idea for all our proofs. 

Theorem 5. Assume that the gaps A(/),V/ G AT, are 
known. Let n* be the minimal integer that satisfy n*(f) > 
4c . p or an y choice of rj n = /3 n and any 

c > 18, the regret of the AOEECC-EXP3++ algorithm with 

£ n (a) = C lU nA(fP ~ an< ^ = in the stochastic 

regime satisfies: 

Regret n = R(n ) plp < 0(kK c -^f) (15) 

Violation,, = E[]T" =1 p^Pt - P c ] + < 0(ni). 

From the upper bound results, we note that the leading 
constants k and K are optimal and tight as indicated in Com- 
bUCBl ll37l algorithm. However, we have a factor of In (n) 
worse of the regret performance than the optimal “logarithmic” 
regret as in (32) (37), where the performance gap is trivially 
negligible (See numerical results in Section IX). 

Theorem 6. The expected EE of e-SPA scheme of 
AOEECC-EXP3++ in the stochastic regime is at least 

r< ZckK In ((1 +ae)t sp /T) 2 

^max 

> - 


Gn ( e )^a 


Af(\-\-ae)t sp /T 


ae 


1 


where T = nt sp + ent a . With sufficiently large T, our e-SPA 
scheme of AOEECC-EXP3++ is almost optimal. 

1) A Practical Implementation by estimating the gap: 
Because of the gaps A(/),V/ G K can not be known 
in advance before running the algorithm. In the next, we 
show a more practical result that using the empirical gap as 
an estimate of the true gap. The estimation process can be 
performed in background for each channel / that starts from 
the running of the algorithm, i.e., 

A„(/) = min{l, 1(L„(/) - min(Z„ (/')))}. (15) 

This is a first algorithm that can be used in many real-world 
applications. 

Theorem 7. Let c > 18 and r] n = Pn- Let n* be the minimal 
integer that satisfies n* > 4c K , and let n*(f) = 

max{n*, |"e 1 / A ^) 2 ~|} and n* = max{f eK yn*(f). The regret 
of the AOEECC-EXP3++ algorithm with £„(/) = w ? (ln "A 


and Sr, = 2 k 


K\nK 


termed as AOEECC-EXP3++ 


AVG 


in 


the stochastic regime satisfies: 


c In (n) 3 


Regret n = R(n) p y P <0(kK 


) 


(16) 


Violation™ = EE" =1 pjP t - P c ]+ < 0(ni 
From the theorem, we see in this more practical case, another 
factor of ln(n) worse of the regret performance when com¬ 
pared to the idealistic case for EECC. 

Theorem 8. The expected EE of e-SPA scheme of 
AOEECC-EXP3 ++ avg in the stochastic regime under the 
oblivious jammer is at least 

^ 2ckK In ((l+ae)t sp /T) 3 

^max 

> - 


Gn (^)^a 


A f (l+ae)t sp /T 


ae 


1 


T 

where T = nt sp + ent a . With sufficiently large T, our e-SPA 
scheme of AOEECC-EXP3++ is almost optimal. 


C. Mixed Adversarial and Stochastic Regime 

The mixed adversarial and stochastic regime can be re¬ 
garded as a special case of mixing adversarial and stochastic 
regimes. Since there is always a jammer randomly attacking kj 
transmitting channels constantly over time, we will have the 
following theorem for the AOEECC-EXP3++ AyG algorithm, 
which is a much more refined regret performance bound than 
the general regret bound in the adversarial regime. 

Theorem 9. Let c > 18 and rj n = Pn- Let n* be the minimal 
integer that satisfies n* > 4c K ? anc [ Let n*(/) = 

max{n*, [e 1 ^^ 2 ]} and n* = max{f eK yn*(f). The regret 
of the AOEECC-EXP3++ algorithm with £„(/) = w | (ln ^ )2 

and 5 n = 2k \J K1 “ K , termed as AOEECC-EXP3++ / ' VG under 
oblivious jamming attack, in the mixed stochastic and adver¬ 
sarial regime satisfies: 

Regret n = R(n) plP <0(K(k - kj) ^f- ) 

+0(Akj VtKlnK) (17) 
Violation™ = EEj l =1 pJP t - P 0 ]+ < 0{ni). 

Note that the results in Theorem 9 has better regret perfor¬ 
mance than the classic results obtained in adversarial regime 
shown in Theorem 1 and the anti-jamming algorithm in f25l . 

Theorem 10. The expected EE of e-SPA scheme of 
AOEECC-EXP3++ avg , in the mixed stochastic and adversarial 
regime under oblivious jamming attack is at least 

r 2cK(k-kj) ln«l + cxe)tsp/T) 3 

Gn(e)t a v maX _ Aj(l + ae)t sp /T _ 

T — J- + 1 

(l + cxe)t S pK In K 

^+1 ’ 

cte 1 

where T = nt sp + ent a . With sufficiently large T, our e-SPA 
scheme of AOEECC-EXP3++ is almost optimal. 

Theorem 11. Let c > 18 and rj n > /3 n . Let n* be the 
minimal integer that satisfies n* > 4c K , and Let 

n*(f ) = max{n*, |" e 1 / A (/) 2 "|} and n* = max{f eK yn*(f). 

The regret of the AOEECC-EXP3++ algorithm with £ n (/) = 

an ^ = 2 l/lM termed as AOEECC- 

nA n _i(/) 2 n V n 

EXP3++ avg 6-memory-bounded adaptive jamming attack, in 
the mixed stochastic and adversarial regime satisfies: 

Regret™ = R(n) p t p < 0(K(k - kj) cln £ )3 ) 

+O{{0 + l)(4fcjVATln K)ini) (18) 
Violation™ = EE" =i pJPt - P 0 ]+ < 0{rA). 
Theorem 12. The expected EE of e-SPA scheme of 
AOEECC-EXP3++ avg , in the mixed stochastic and adversarial 
regime under 0-memory-bounded adaptive jamming attack is 
at least 

r 2cK(k-k j ) In qi + cxe)t sp /T) 3 

Gn{e)t a \ max Af (l + cxe)t S p/T 

T - -A-4-1 

ae 1 ^ 

(6»+l)(4/c i VK In KT) 2/3 ((l+ae)t sp ) 3 

where T = nt sp + ent a . With sufficiently large T, our e-SPA 
scheme of AOEECC-EXP3++ is almost optimal. 


D. Contaminated Stochastic Regime 

We show that the algorithm AOEECC-EXP3 ++ avg can 
still retain “polylogarithmic-n” regret in the contaminated 


8 









































stochastic regime with a potentially large leading constant in 
the performance. The following is the result for the moderately 
contaminated stochastic regime. 

Theorem 13. Under the setting of all parameters given in 
Theorem 2, for n*(/) = max{n*, [e 4 / A ^ 2 ]}, where n* 
is defined as before and n 3 = max{f eK yn*(f ), and the 
attacking strength parameter ( G [0,1/4) the regret of the 
AOEECC-EXP3++ algorithm in the contaminated stochastic 
regime that is contaminated after r steps satisfies: 

Regret n = ii(n) p x P < O ffice) a’ ) + Kn\ 

Violation = EE” =1 pjP* - P Q ]+ < 0(ni). 

Note that ( can be within the interval [0,1/2). If ( G 
(1/4,1/2), the leading factor 1/(1 — 2£) will be very large, 
which is severely contaminated. Now, the obtained regret 
bound is not quite meaningful, which could be much worse 
than the regret performance in the adversarial regime for both 
oblivious and adaptive adversary. 

Theorem 14. The expected EE of e-SPA scheme of 
AOEECC-EXP3++ in the contaminated stochastic regime that 
is contaminated is at least 

- r _ 2ckK In ((l-\-a£)t sp /T) 3 

G n {£)t a ^ (l-2)A / (l+ae)t ap /T 

T ~ ^ + 1 

as 

where T = nt sp + ent a . With sufficiently large T, our e-SPA 
scheme of AOEECC-EXP3++ is almost optimal. 


E. Further Discussions on the SPA Scheme 


Besides the study of performance bounds in different 
regimes, we further discuss the following important issues on 
the sensing and probing phases. 

1) Impact of Sensing Time: Usually, the false alarm proba¬ 
bility of sensing affects the performance of SPA scheme, which 
we do not analyze it yet. Since we consider energy detector for 
channel sensing, the false alarm probability is calculated by 
El Pfa(t s ) = Q((i | - 1 )Vtsfs), where || is the decision 
threshold for sensing, f s is the channel bandwidth, and QQ is 
the Q -function for the tail probability of the standard normal 
distribution. Consider the false alarm probability, we have 
Corollary 15. The expected EE of e-SPA scheme of the 
AOEECC-EXP3++ Algorithm is at least 


G n ( e )^a 


G n 


-4 k 


> 


(1 +aeo)t sp K In K 
(1 ~Pfa)T 




The proof is similar to that of Theorem 2. For each round, 
each channel is sensed and probed successfully with probabil¬ 
ity 1 — Pfa in expectation. Replacing T with (1 —Pf a )T we get 
the above result. Here Pf a is a function of t s and a = t ^ . 
Treating t s as a variable, we can compute the optimal t s which 
maximizes the expected throughput by numerical analysis. By 
similar argument, we can have similar counterpart corollaries 
related to Theorem 4, Theorem 6 , Theorem 8 , Theorem 10, 
Theorem 12 and Theorem 14. We omit here for brevity. 

2) Impact of Probing Time and Others: As pointed out, the 
step of probing is not necessary in our problem. The reason 
why we need it is that we want to make sure the EE to be high 
enough, which is important when the channel qualities are very 
bad. On the contrary, when channel qualities are good enough, 



Fig. 2: Cooperative Bandit Learning among Multiple SUs 


a sensing/probing scheme may achieve better EE since there is 
no probing overhead. Our e-SPA scheme also can be extended 
to a simplified e-SA scheme without probing steps, in which 
we only get the observation on the transmission rate after each 
successful transmission to calculate the EE. Thus, in n round, 
en expected data rates will be observed by e-SA scheme. 
Hence, we can show that the expected throughput of e-SA 

G u e 0 )t s K I n K 

scheme is - v , £ — 


where a' = 7 ^. Let t* be 

t s P 


the probing time which satisfies 


xr+k 


. / (l + oc' e 0 )t s K\nK 

V eT 


/ (l + ae 0 )(t s +t;)innX 

max v ^ +1 eT -. When t p < t*, we will use e- 

SPA, otherwise, we use e-SA. 

In addition, the knowledge of pi can be used to optimize t a 
to maximum the expected EE, if SUs possess this statistical 
information. 


VI. Cooperative Learning among Multiple SUs 

The focus on the previous sections are from a single SU’s 
perspective, where the proposed AOEECC-EXP3++ is an 
uncoordinated algorithm without cooperation with other SUs. 
It is well known that exploiting the cooperative behaviors 
among multiple SUs, such as cooperative spectrum sensing 
El and spectrum sharing 0 , are effective approaches to 
improve the communication performance and EE of SUs. This 
section focuses on the accelerated learning by cooperative 
learning among multiple SUs. As we have noticed, considering 
information sharing among multi-users in MAB setting is 
recently a novel research direction, and we have seen initial 
results for stochastic MAB in m. Thus, our work can be 
regarded as the first one for both adversarial and stochastic 
MABs. 

The cooperative learning can use Common Control Chan- 
nels (CCCs) 0 to sharing information as illustrated in Fig. 2. 
Intuitively, when multiple SUs sening/probing multiple strate¬ 
gies simultaneously and exchange this information among 
them, this would offer more information for decision making, 
which results in faster learning and smaller regret value. At 
each timeslot n, suppose there are L n SUs cooperatively 
perform the e-SPA who wish to explore a total of L n strategeis 
over the total of the TV (1 < L n < N) and picks a subsect 
O n C {1,...,7V} of L n strategies to probe and observe the 
channel losses and power strategies to get the EE. Note that the 
channels losses that belong to the un-sensed and un-probed set 
of strategies V\O n are still unrevealed. Accordingly, we have 
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the probed and observed set of channels O n with the simple 
property / G (5 n ,V/ G i G O n . The proposed algorithm 2 
based on Algorithm 1 that considers fully information sharing 
among L n SUs such that for a single SU u. The probability 
Qn,u = Qn = (£>n(l), •••, <?n(N)) of each observed strategy is 

Qn(i) = Pn{i) + (1 j , if i e £>„, (20) 

where a mixture of the new exploration probability (L n — 
1)/(N— 1) is introduced and p n (i) is defined in (fill) . Similarly, 
the channel probability g n = (g n (l)..., g n {p)) is computed as 

777 — 1 ~ 

0n(f) = Pn(f) + (1 - Pn(f)) ^ , if f € O n . (21) 

Here, we have a channel-level new mixing exploration prob¬ 
ability (m n — l)/(n — 1) and p n (f) is defined in (fl2l ) . 
The probing rate m n denotes the number of simultaneous 
sensed/probed/accessed non-overlapping channels among all 
SUs at timeslot n. Assume the weights of channels measured 
by different probes of different SUs within the same time slot 
also satisfy the assumption in Section II-A. The design of (l20l ) 
and (f2TT) is well thought out and the proof of all results in this 
section are non-trivial tasks in our unified framework. 


Algorithm 2 Cooperative AOEECC-EXP3++: Multi-user 
Multi-channel 6-SPA scheme_ 

Input: such that M n e V. Set 

Pn,£n(f) ,€n(f) at in Alg. 1. Vi e V,L 0 (i) = 0 
and V/e£,lo(/) = 0. 
for timeslot n = 1 , 2, ... do 

1: Choose one SPA strategy H n according to p n (fill) . 
Get advice 7r^ n as the selected strategy. Sample M n — 1 
additional strategies uniformly over N. Denote the set of 
sampled strategies by O n , where H n e O n and \O n \ = 
M n - Let ijj = 1 {heo n }- 

2: Update the probabilities g n (i) according to (l2Qb . The 
loss of the observed strategy is 

7 ^n{i) + A n -lP n -l(i) h w' r~ 

WnW = - 777 -C O n . (22) 

QnW 

3: Compute the probability of choosing each channel 
p n {f ) that belongs to i according to (fl2b . 

4: let l(/) n = ^-(f)feheo n - Update the channel proba¬ 
bilities Q n (f) according to (I2T1) . The loss of the observed 
channel is 

Mf) = W. t (f) n yf e 6 n ( 23) 

Qn \J ) 

5: Updates all weights w n (/) ,u) n (i), W n , Lagrangian 
multiplier and channel access probability e as in Algo¬ 
rithm 1. 

end for 


A. The Performance Results of e-SPA with e = 1 

If m n is a constant or lower bounded by m, we have 
the following results. Define G n (e) as the expected average 
EEs that can be achieved by the e-SPA scheme over n 
rounds. The Theorem 16, Theorem 17, Theorem 18, Theorem 
19, Theorem 20 and Theorem 21 bound the regret of EE, 
maxi€s r ~Eirvp {G Uj i} — E[G n ( 1)] when set e = 1. From 


these results, we see a rate of m in accelerating of learning 
performance. 

Theorem 16. Under the oblivious attack with same setting 
of Theorem 1, the regret of the AOEECC-EXP3++ algorithm 
in the cooperative learning with probing rate m satisfies 

Regret n = i2(n) p x P < 0( 4^/nf lnAT) (24) 

Violation,,, = EE" =1 pJPt - Po]+ < 0(ni). 

Theorem 17. Under the 6-memory-bounded adaptive attack 
with same setting of Theorem 3, the regret of the AOEECC- 
EXP3++ algorithm in the cooperative learning with probing 
rate m satisfies 

Regret n = R{n) p t p < O((0 + l)(4kfjfhiK)i n§). 

Violation n = EEILiPtl 5 * - P 0 ]+ < 0{n%). 

Considering the practical implementation in the stochastic 
regime by estimating the gap as (IV-B1I) . then we have 

Theorem 18. With all other parameters hold as in Theorem 
4, the regret of the AOEECC -EXP3 ++ algorithm with £ n (/) = 

and S n = 2— J KlnK in the cooperative learning 


c(ln ny 


mtA n _i(/) 2 

with probing rate m, in the stochastic regime satisfies 
Regret n = R(n) p r P < 0( ^ ) 

Violation,, = E[^" =1 pJP t - P Q ]+ < 0(ni). 

Theorem 19. With all other parameters hold as in Theorem 
9, the regret of the AOEECC-EXP3++ algorithm with £ n (/) = 


c(ln n) 


and S n = 2^- 


KlnK 


under oblivious jamming 


mtA n _ 1 (f) 2 

attack in the cooperative learning with probing rate m, in the 
mixed stochastic and adversarial regime satisfies 

Regret n = i?(n) p , P < 0( K ^ )c ^f ) 

+0(^VtKlnK) ( 26 ) 
Violation™ = E[£)" =1 pJP t - P 0 ]+ < 0(ni). 
Theorem 20. With all other parameters hold as in Theo¬ 
rem 11, the regret of the AOEECC-EXP3++ algorithm with 

&»(/) = yKzfjf and 5n = the cooperative 

learning with probing rate m, in the mixed stochastic and 
adversarial regime satisfies 


Regret n = R(n) plP < 0( 


K(k—kj ) cln(n) ,: 


) 


+0{{9 + \)%(AkVK\a.K)ini) ( 27 ) 
Violation™ = E[^" =1 pjP t ~ P 0 ]+ < 0(ni). 

Theorem 21. With all other parameters hold as in Theorem 
13, the regret of the AOEECC-EXP3++ algorithm in the 
cooperative learning with probing rate m in the contaminated 
stochastic regime satisfies 

Regret™ = R(n) plP < ) + Kn% (2g) 

Violation™ = E[^" =1 pJP t - P Q ] + < 0{ni). 


B. e-SPA and Cooperative Sensing/Probing Issues 

Due to space limit, we do not plan to present the least EE 
performance guarantees in the cooperative learning scenarios 
for e-SPA with general e. Obviously, it is simply a division of 
factor 777 in the regret bound parts as in the related Theorem 
2, Theorem 4, Theorem 6, Theorem 8, Theorem 10, Theorem 
12 and Theorem 14. 

In addition, cooperative sensing are necessary to be adopted, 
where the cooperative sensing gain will improve the sens- 
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ing/probing performance. The formula by considering energy 
detector with cooperative sensing can be found in many exist¬ 
ing works, such as in ED- Hence, the analysis of cooperative 
e-SPA-scheme of the AOEECC-EXP3++ on the issues of 1) 
Impact of sensing time and 2) Impact of probing time and 
others can follow the same line as in Section V.E. 

C. Distributed Protocols with Multiple Users 

While the cooperative learning scheme offers the optimal 
learning performance, the design of decentralized protocol 
without using a CCC is challenging issue | 34l . As presented in 
Section Ill-Section V, the energy detector cannot differentiate 
spectrum usage of PUs and SUs, the opinion of channel 
qualities of each SU is also affected by other SUs. Notice 
that the e-SPA scheme of the AOEECC-EXP3++ developed 
previously are applicable for all SUs, where their observation 
of the best multi-channel access strategies are the same, if 
each of these channels has the same mean across the players. 

For fully distributed solutions, we already have proposed 
solution as shown in Section Ill-Section V to let each SU run 
the e-SPA scheme based on their own observation. However, 
this will come to the situation that all the users will access 
the same set of channels that result in low efficiency. This 
could be resolved by applying an approach similar to the TDFS 
scheme [36] by introducing round-robin schemes among SUs. 
We leave the details on this point in future works. 

VII. Proofs of Regrets in Different Regimes 

We prove the theorems of the performance results from the 
previous section in the order they were presented. 

Lemma 1. (Dual Inequality) Let f t ( A) = 4fA 2 + 
A(pJ_ 1 Pt_i-P 0 ), A t = [(At—i - ?7t-i A /7t-iV/t_i(At_i))] + 
and Ai = 0. Assuming r] n > 0, we have 

n X n 1 

E< A <- A >(R» ^ p?p>)+f EW- a2 ) < -t= a2 

t= 1 t=l hs/ln 

+ Xv / ^ + X?(PlP*) 2 (29) 

t= 1 t= 1 

Proof: First, we note that 

At = [(A t —i - ^t-iV7t-iV/t-i(A t -i))] + 

= [(1 — St—lPt— l\/7t-l)At— 1 1 yjlt —1 ( P o — Pt—1 Pt— 1)] + 

< [(1 - ^t-i^t-i v / 7t-i)At-i+^t-i v / 7t-i p o] + 

By induction on At, we can obtain At < Applying the 
standard analysis of online gradient descent 141] yields 

I At A| 2 

= |n + [(At_i—77t-iV7t-i(£t-iAt-i + P 0 —pJ_iPt-i))] — A| 2 

< |At—i — A| 2 + 7 7 t_ 1 2 7t-i|V/t_i(At_i)| 2 
—2(At_i — A)(?7t-i^/7t-iV/t-i(At-i)) 

< |At—i — A| 2 + ?7t-i 2 7t-i|^/t-i(At-i)| 2 
+277t-iA/7t-i(/t-i(A) - ft- i(At-i)) 

Then, rearrange terms, we get, 

/t-i(At-i)-/t-i(A) < 2 ^ t _\^——(\X — Xt-i\ 2 — |A At—11 2 ) 

+ v ^iH- 1 [v/t _ i( At— 1)|2 

- 2 v ^7rm-i (l A — A *-i| 2 — l A ~ A *-i| 2 ) 

I V'Yt-l'nt-i / T D \2 , ^ 


Note that r] t varies with t. For the first term leading factor 
in the r.h.s of the above inequality, use the trick by letting 
hn = 7t as indicated in f20l (page 25), e.g. if r] t = ^=, we 
have Ylt =l f/t — fo f/t^ = a factor of 2 gap between 

r\ n and p t • Thus, substitute r] t by ^ n /2 and expanding the terms 
on l.h.s and taking the sum over t, we obtain the inequality. 


A. The Adversarial Regime 


The proof of Theorem 1 borrows some of the analysis of 
EXP3 of the loss model in (2(3. However, the introduction of 
the new mixing exploration parameter and the truth of channel 
dependency as a special type of combinatorial MAB problem 
in the loss model makes the proof a non-trivial task, and we 
prove it for the first time. 

Proof of Theorem 1. 

Proof: Note first that the following equalities can 


be easily verified: E^, 


4(i),E^ p l 


— ^n(In) _ 

~ p n (J n ) aA1U ^n-P npn (/ n ) 

Let <f> n (i) = £ n (i) + A n P n (i). The regret with respect to 
®n(i) , is 


>(<) = 

and Ej 


^n(^n),E^ £ n (i) — 


P n 

1 = N. 






lt=l t =1 

The key step here is to consider the expectation of 
the cumulative losses i n (i) in the sense of distribution 
i ~ p n . For all strategies, we have the distribution vector 


_ w n -i(i) 
W n -i 


and for all the 


q n = (<ln(l),-,qn(N)) with q n (i) = 

channels, we have vector q n j = (g n j(l), q n j(K)) with 

Qnj(f) = Ei:f w„3~ lW - Let £n W = E/ ei £«(/)• 

However, because of the mixing terms of p n , we 
need to introduce a few more 

(T2 fel £ n(f)i •“’Xfg •••’X 


tec 


g terms of p n , 
notations. Let u 

Sn(f), 0,..,0) 

m 

p^~ u 


fe\c\ 


be the distribution over all the strategies. Let q n = g jj) 
be the distribution induced by AOEECC-EXP3++ at the time 
n without mixing. Then we have: 

= (! “ E/ e t(/)) E i~q t $t(*) + £t(i)Ei~ u $t(i) 

= (! - E/ £ t(/))(^lnE^ qt exp(-? 7 t ($t(i) 

(30) 


Ej~q t ^t(i)))) 
(i-E f £«(/)) 


+Ej 


nt 


In E,-, 


; exp(-r?t$t(*))) 


In the second step, we use the inequalities Inx < x—1 and 
exp(—x) — 1 + x < x 2 /2, for all x > 0, to obtain: 

InEj^q t exp ~ %~q t $tC?))) 

= lnE ; 

< E- 


(a) 

< E;, 


-q t 2 

where we used fjjj 


-q t exp + ntEj^QtU) 
q t (exp- 1 + Vt$t{j)) 

(6} — - - - 2 , x2 2n 


(31) 


< v 2 E^ q Mi)' + A t 2 *,qP „(*) 


< 


£t (f) i n ^e above inequality 

(a) and the fact (x + y) 2 < 2x 2 + 2 y 2 in the above inequality 

( b ) . Moreover, take expectations over all random strategies of 
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losses ^t(i) 2 , we have 


E, 

= E 
= E n k 

= kE n 
< kE n 




= E„ 


• \ 2 




" N 2 ~ 


" N " 

E<feW(E4(/)) 

= E n 

E Qt(i)kJ2 tt{f) 

i =i fei 


i=l f£i 


K 


E WY E Qt(i) 

f =1 zG5 r :/Gz 



" K „ 

II 

N 

s 

EW 2 qtAf) 


1 




K 

= k E 


K 

E 

/'=i 


= k E 

^ Pn(/') 


(32) 

< 2/cK, 


_ gt,/(/') _ 

(i-E/ £ n(/))gt J /(/')+E/ G i £ n(/)|{«eC:/ei}| 

where the last inequality follows the fact that 
(1 — E / e n{f)) > | by the definition of e n (f). Similarly, 

. 21 


E„ 


P n (0 


< 2kK , 


(33) 


In the third step, note that Lo(i) = 0. Let Y n (? 7 ) = 
- In jjr E,Ei exp(— rjL n (i)) and T 0 (^) =0. The second term 


in © can be bounded by using the same technique in 
(page 26-28). Let us substitute inequality (l32t into (l44l) . and 
then substitute (l44l ) into equation (l43t and sum over n. Use 
the fact that the sum of expectation on u and q t with respect 
to &t{j) is less than E i t ~p t $t(j)- Take expectation over all 
random strategies of losses up to time n , we obtain 


Er, 


EEi~p t ^W 


L=i 

+E ? 


< 2kKJ^ Vt + 2kKJ2 Vt>H 


EE 

t=l 


t=l 

+ ^ 


In AT 

t=l Vn 

E( T t(^+l) -T t (r7t))) 

,t=i 


The last term in the r.h.s of the inequality is less than or equals 
to zero as indicated in l20l . Then, we get 


t=l 


<2 kKjfvt 


In N 


t= 1 


t= 1 


(a) " 

< 2 kK^Vt 
1=1 


k 


Tin 

In K 

Vn 


2 kK^Vt^t 

t= 1 
n 

^2kKj2vt\l (34) 


t= 1 


Note that, the inequality (a) is due to the fact that N < K k . 
Combine (l29l) and (l34b gives that 


E 


t= 1 


E 


+ 

TC 


E A(p?Pt - Po) - ( 

t=l v 


6 n n 

2 


O 

^>/7t y 


A 2 


(a) n 

< E Vt 

t =1 


+ E (kKrin - 4f) E A? 


k]n 


t=l 

+E 


t=i 


E A t (pJP t - Po) 

m t=i 

For the above inequality (a), we use the trick by letting r] n = 
rj t indicated in [16] (page 25) again to extract the rj t from the 
sum of A 2 over n and the inequality in (l46l) . Let kKr] n < 
by setting properly the values such that rj n = 0(S n ) (shown 
in the next). Thus, the last two terms in the r.h.s of the above 


inequality is non-positive. By taking maximization over A, we 
have 


n 


’[e (plPt-Po)] 2 " 

max E Ppt(i) 

p!Pt<Po t= 1 

+E 

Lt=i J + 

2(6n/2+l/ri) 


(a) n n 

< 2kK E Vt + k 1 ^ + E Vt y/Tf 


Note that our algorithm exhibit a bound in the structure like 


Regret,, 


Violation;; 


< n 1 P. We can derive a regret bound and 


(36) 


0(n 1-c y 

the violation of the long-term power budget constraint as 
Regret n < 0(n 1_/3 ) 

Violation < y/0{[n + n 1- ^] n 1_a ), 
where the last bound follows the naive fact —Regret n < 0(ri). 
In practice, we this bound is coarse, and we can us the 
accumulated variance to obtain better violation bound. 

Then, according to (l36lh we obtain 


E 


E 


max E Pt’M*) -Pt’MJO 

pTp t <P 0 t=l 


< AkKmjn + k 


In K 
T]n 


EpJp* 


t= 1 


< 


J + 


J rT,mVYt<^nK\nK (37) 

t =l 

+ AkVnK In (5 n n + 2/rj n ). 


Let Tj n = p n = = 0(n 1/2 ) , 1 /ry = O (n 1 / 2 ). 

Because 7 n = 0 (^M) ? t he term EEi^x/E = 0{ln(n)). 
Thus, it can be omitted when compared to the first two terms 
in th e r.h.s of (l49l) . Moreover, in this setting, we set S n = 
2 kyj such that 2 kKr\ n < Then, h n n = O (n 1 / 2 ). 
We proof the theorem. ■ 

Proof of Theorem 3. 

Proof: To defend against the ^-memory-bounded adaptive 
adversary, we need to adopt the idea of the mini-batch protocol 
proposed in ll29l . We define a new algorithm by wrapping 
AOEECC-EXP3++ with a mini-batching loop (511. We spec¬ 
ify a batch size r and name the new algorithm AOEECC- 
EXP3++ r . The idea is to group the overall timeslots 1, ...,n 
into consecutive and disjoint mini-batches of size r. It can be 
viewed that one signal mini-batch as a round (timeslot) and 
use the average loss suffered during that mini-batch to feed 
the original AOEECC-EXP3++. Note that our new algorithm 
does not need to know m, which only appears as a constant 
as shown in Theorem 2. So our new AOEECC-EXP3++ r 
algorithm still runs in an adaptive way without any prior about 
the environment. If we set the batch r = (4 ky/K In K)~^n 3 
in Theorem 2 of (29l , we can get the regret upper bound in 
our Theorem 2. ■ 


B. The Stochastic Regime 

Our proofs are based on the following form of Bernstein’s 
inequality with minor improvement as shown in (35). 

Lemma 2. (Bernstein’s inequality for martingales). Let 
be martingale difference sequence with respect 
to filtration T = {fFf) x <k<m and let Y k = Ej=i x j be 
the associated martingale. Assume that there exist positive 
numbers v and c, such that Xj < c for all j with probability 
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1 and Y^k=i ® {Xk) 2 \Fk-i < v with probability 1. 

rb 

P[F m > V2ri+-} <e~ b . 

Lemma 3. Let {e n (f)}^ =1 be non-increasing deterministic 
sequences, such that £ n (f) < £ n (f) with probability 1 and 
e n [f) < e n (f*) for all n and /. Define i/ n (f) = Y^=i kfJJ)' 
and define the event £/ 

nA(/)-(Z n (/*)-Z n (/)) 

< V^M/) + Vn(f*))bn + (1/ 3 fc fc g n (/* 5 ) )& " 

Then for any positive sequence fq, & 2 > •••> and any n* >2 the 
number of times channel / is played by AOEECC-EXP3++ 
up to round n is bounded as: 

mn(f)} < (n* - 1) + E e- b * + k E 

^ *--* (38) 

+ E e- 




where 

M/) = nA(f) - 


\f 2 n6n + fc£„(/*)) 


3e n (/*) • 


t=l 


< E E * (Uf)-UnY 


n , 

= E(: 

£=1 V 
n , 

(a) n . 

S E (=* 

n / 


&(/) s 




( 4 (/*) 2 ]) 


(/) ^ Pt(/ 
(/) 


:/*)) 


(/) + 


*(/*)) 

XP)) = Mf) + Mf*) 


+V[A t = f\£U\P[£U] 

< EP[A t = /!£/_,]! / + Pier] 

t=l L t “ lJ 

n 

< P[A t = } + e 

t=i L t - 1J 


o—bt-i 


We further upper bound E[YL, = f\S{_i\t{ g f I as follows: 

p[A t = /if/_i]i {£/ _ i} 

= pt(/)i {£ /_ i} 

< (<7t(/) + fce t (/))l {£ /_ i} 

= e(/) + E -e!r i(i) )i^ 


(a) 


Wi-» 


<(*£*(/)+ e _,, ‘(*‘ (i) “*‘ (i * ) ))l 


( 6 ) 


{«/-!> 




= (&£*(/) 


( C ) . 




-v± h t (f)p-vt At (r t (/* )-r t (/)) 


Proof: Note that the elements of the martingale difference 
sequence {A(/) - (£ n {f) ~ 4(/*))}£La b Y max{A(/) + 
4(/*)} = kFJF) + !• Since Sn(/*) < ^(/*) < 1/(2If) < 
1 /4, we can simplify the upper bound by using feg + 1 < 

(i+i) 

ijfj- 

We further note that 

EE, [(A(/)-(4(/)-4(/*))) s 

t =1 L 


The above inequality (a) is due to the fact that channel / 
only belongs to one selected strategy i at t, inequality (b) is 
because the cumulative regret of each strategy is great than the 
cumulative regret of each channel that belongs to the strategy, 
inequality (c) is due to the fact that £ t (f) is a non-increasing 
sequence v t (f) < k£ ^ . Substitution of this result back into 
the computation of E [N n (f)] completes the proof. ■ 

Proof of Theorem 5. 

Proof: The proof is based on Lemma 1 and Lemma 3. 
Combine (l29l) and (l38l ) 


E 


EpfMO-p^tUt) 

t =1 


+ 


E 


n / \ 

EA(p?P t -P 0 )-(f + i) A 2 




< EE[^(/)]A(/)+E^ 

/=! t =1 

-f E A? +E 

t=l 


E A t (pjp t - P 0 ) 
£=1 


Obviously, the last two terms in the r.h.s of the above inequal¬ 
ity is negative. By taking maximization over A, we have 


E 


with probability 1. The above inequality (a) is due to the 
fact that q n (f) > E/ e i £ n(/) I {i & C : f e i}|. Since each 
/ only belongs to one of the covering strategies i G C, 
\{i E C : f e i}\ equals to 1 at time slot n if channel / is 
selected. Thus, q n {f ) ^ J2 f ei £ M) = k£n{f)- 

Let denote the complementary of event £/. Then by the 
Bernstein’s inequality E[£/] < e~ bn . The number of times the 
channel / is selected up to round n is bounded as: 

mn(f)] = E V[A t = /] 

t =1 

= E V[A t = f\sU]P[sU] 

t= 1 


max Ep^f(!)-p^t(/ ( ) 
p Tp t <p 01 =1 


+E 


E (p)Pt-Po) 


2(<Sn/2+l/rj) 


(39) 


iV 


Thus, 7 „ = Ef=i £ n (/) = E/=1 


< E E[iv n (/)]A(/) + E vtVTt- 

/=1 t=l 

Set b n = ln(nA(f) 2 ), e n (f) = £„(/) and £„(/) = raA b ( " /)a • 

_ ln(nA(/) 2 ) _ ^/ln(n)\ 

nA(/) 2 - 

For any c > 18 and any n > n*, where n* is the minimal 
integer for which 77* > 4c - , we have 

3£n(/*) 


hn{f) = nA{f) - 


x(f) k£ n 

i " A (/) - 


(a) 


>nA(/)(l-^-^)>inA(/). 

The above inequality (a) is due to the fact that (1 — -^== — 

( i + 1 ) C 

y4 3c k is an increasing function with respect to k(k > 1). The 
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transmission power is quasi-concave to the reward such that 
the accumulated power allocation strategy have (— h t (f ) + 
A*(f t (f) ~ f t (/*))) < 0, and by substitution of the lower 
bound on h n (f ) into Lemma 3. Thus, we have 

A (/) ' 

t= 1\ 

< J z C ln ( n ) 


E[JV„(/)] < + £( e " 


+ 1“% + n* + 


m 


A(fY~ 1 A (/) 2 1 1 ^A (/) 2 

where we used Lemma 3 to bound the sum of the exponents 
in the first two terms. In addition, please note that 77* is of the 
order 0( A , ). The last term is bounded by Lemma 10 

in f35l . 

The (l40l) bounds the firs t item in the r.h.s of ([39b . For the 
second term, since t] n =\ EtU Vts/ji = 0(y/ln(n)). 
That indicates 


E 


n 


"[e (plPt-Po)l 2 " 

max E -pftt(It) 

_pTp t < Po t =l 

ln (n) .— x IE \ 

+E 

Lt=i J _|_ 

( fc +v / s¥) 


(41) 


Moreover, set S = 2 k 


KlnK 


, we have kKr\ < | and 


Sn + \ ~ (2k + 1) \j2Kn = 0(y/n). Then, we obtain 


max E P ~ P ftt(It) 

p^Pt-Po t =l 


t =1 


-I + 


<0 (fc £M»>i) 


< W(n + k 


c In (n) 2 

~HIY 




(42) 


= 0 ( 714 ). 

Thus, we proof the theorem. ■ 

Proof of Theorem 7. 

Proof: The proof is based on the similar idea of Theorem 
5, Lemma 1 and Lemma 3. Here, we just show the difference 
part. Note that by our definition A n (f) < 1 and the sequence 
s n (f) = £ n = minj^A,/3 n , cln ^ n ^ } satisfies the condition 
of Lemma 10. Note that when f3 n > cln ^ }, i.e., for n large 

enough such that n > — \ n ^K ) —» we have = — 7T~ m 
Let b n = / 77,(77,) and let 77,* be large enough, so that for all 
n > n* we have n > 4c K and n > e A G) 2 . With these 
parameters and conditions on hand, we are going to bound the 
rest of the three terms in the bound on E [N n (f)] in Lemma 


10. The upper bound of Ylt=n* 
bounding k Yn=n* e t(f)^s?f 
we have 


0 — bt 


vUv 


we 


is easy to obtain. For 
note that holds and 


A n(f) > ^(max(L n (fc)) - L n (f)) > ±(L n (/*) - L n (/)) 


> £M/) = l (nA(/) 

= £ ( nA (/) - 


(i 


2n 


(<*) , ( 

> - \nA(f) - 


; __ (j±lE 

•y/c/c ln(n) 3cln(n) 


3—n / 


2n 


1.25n 


(b) 

> 


y/c\n(n) 3c ln(n) 

A (/) (l- ts-W)>I a (A 

where the inequality (a) is due to the fact that 4(n A (/) - 
) is an increasing function with respect to 


2 n 


(Hi)" 


y'cfc ln(n) 3cln(n) 

k(k > 1) and the inequality (b) due to the fact that for n>n* 


we have ^J^n{n) > 1/A(/). Thus, 


£ K (f)t {£ f K _ i} < 


c(lnn) 2 

nA n (f) 2 


Ac 2 (Inn) 2 
nA(f) 2 


and kY^t= n * = 0 ( ^(/^ )■ Finall Y> for the 

last term in Lemma 10, we have already get h n (f) > \A(f) 
for 77, > 77,* as an intermediate step in the calculation of bound 
on A n (f). Therefore, the last term is bounded in a order of 
0( a(/) 2 )• Use all these results together we obtain the results 
of the theorem. Note that the results holds for any r\ n > f3 n . 


C. Mixed Adversarial and Stochastic Regime 
Proof of Theorem 9. 

Proof: The proof of the regret performance in the mixed 
adversarial and stochastic regime is simply a combination of 
the performance of the AOEECC-EXP3 ++ avg algorithm in 
adversarial and stochastic regimes. It is very straightforward 
from Theorem 1 and Theorem 7. ■ 

Proof of Theorem 11. 

Proof: Similar as above, the proof is very straightforward 
from Theorem 3 and Theorem 7. ■ 

D. Contaminated Stochastic Regime 
Proof of Theorem 13. 

Proof: The key idea of proving the regret bound under 
moderately contaminated stochastic regime relies on how to 
estimate the performance loss by taking into account the con¬ 
taminated pairs. The rest of the proof is based on the similar 
idea of Theorem 7, Lemma 1 and Lemma 3. Here, we just 
show the difference part. Let 1* j denote the indicator func¬ 
tions of the occurrence of contamination at location (n,/),i.e„ 

1* j takes value 1 if contamination occurs and 0 otherwise. 

Let m n (f) = l*j£ n (f) + (1 - l* ?/ )/i(/). If either base arm 
/ was contaminated on round n then m n (f) is adversarially 
assigned a value of loss that is arbitrarily affected by some 
adversary, otherwise we use the expected loss. Let M n (f) = 
£?=i m n (f ) then (M n (f) - M n (/*)) - (X(/) - L n (f*)) 
is a martingale. After r steps, for n > r, 

(M n (f) — M n (f*)) > ~ Uf*)) 

+n min{l - t* n f , 1 - 1 *,/»}(m(/) - M/*)) 

> -(nA(f) + (n- CnA(/))A(/) > (1 - 20nA(/). 
Define the event Z[: 

(l-2C)nA(/)— (X(/) - L n (f*)) < 2 

where e n is defined in the proof of Theorem 2 and v n = 
J2t=l ifei - - Then by Bernstein’s inequality P[2T] < e~ bn . The 
remanning proof is identical to the proof of Theorem 2. 

For the regret performance in the moderately contaminated 
stochastic regime, according to our definition with the attack¬ 
ing strength ( e [0,1/4], we only need to replace A(/) by 
A(/)/2 in Theorem 4. ■ 

VIII. Proof of Regret for Accelerated AOEECC 
Algorithm 

We prove the theorems of the performance results in Section 
VI in the order they were presented. 
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A. Accelerated Learning in Adversarial Regime 


The proof the Theorem 16 requires the following Lemma 
from Lemma 7 [49]. We restate it for completeness. 

Lemma 4. For any probability distribution u on {1,..., K} 
and any m G [K]\ 


£ t (i) with respective to distribution u, we have 

N 


E„ 


E, 


'i~ipt 


M i 


= E„ 


E MMM) 


/=i iev-.fei 

K 
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Mf)M~ 1) 


u(f)(K -m) + m 
Proof of Theorem 16. 
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< k¥* n 


1 m 


Elf 1 -(/') 


En 

EftWtEW)) 

< E„ 

E M*)(EMf)) 


*=1 fez 


i =1 /ei 


E„ 

E Lf) E Mi) 

= E„ 
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- f hi Mn 


( 47 ) 
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^ Pt(/)+( 1 -pt(/)) z ^Gr 

(«) ^ , m (b) „ 


jV = 1 Pi (/) + (! —Pi (/)) ~K^T 

Proof: With similar facts and notations as in the proof of u u . v . , x . , . ^ . ~ , r\ 

■\ , _ r where the above inequality (a) is due to the fact that Pn\j) 

>rem 1, we have: Then we have: , . r . /n r u T 

/o. _ i t i onrl tna ohAiia inanno ini I n ta tna I ammo / 


Theorem 

Ei~ Pt $t(i) = (1 - E/ e t(/))E^ qt $ t (*) +e t (*)E^ u $ t (z) 

= (! - E/£t(/))(^ lnE *~q t ex P(-m(^t(*) 

_Ej^ q |> t (j)))) (43) °f l° sses U P 1° ti me we obtain 

_ (i T, f ^(f)) lnEj^ q exp(-77 t $t(i))) 


> 

ift-i(f) and the above inequality (6) follows the Lemma 4. 
In the third step, take expectation over all random strategies 


ih 

+E^ u 4 > t(i). 

In the second step, we use the inequalities Inx < x — 1 and 
exp(-x) — 1 + x < x 2 /2, for all x > 0, to obtain: 
lnE^ qt exp - %~q t $tC0)) 

= lnE^ qt exp(-rjt$ t (i)) + r/tE^l^j) 

< E^q t (exp(-^4> t (i)) - 1 + Vt®t{j)) 


E„ 


E E^p (i 


_t=1 

+E ? 




t=i 


7]n 

t= 1 t= 1 

~n —1 

+ E n E(Tt(^+i)-T t (^))) 

_t= 1 


The last term in the r.h.s of the inequality is less than or equals 
to zero as indicated in (20). Then, we get 


(a) 

< E^q t - 


< + A|r/ t 2 E^ qt P n («) 


(44) 
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2 kK 
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t =1 
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In TV 2kK^ , 
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t=i 
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t=l 


Take expectations over all random strategies of losses £t{i) 2 , 
we have 

' N 


E n 
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Note that, the inequality (a) is due to the fact that N < K k . 
Combine (129b and ([34]) gives that 
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For the above inequality (a), we use the trick by letting 
r/n = rj t indicated in [16] (page 25) again to extract the 
rjt from the sum of A 2 over n and the inequality in (l46b . 


Let kKr] n < II] y l by setting properly the values such that 
where the above inequality (a) follows the fact that ^ = 0(( y (shown in the next) xhus? the last two terms 

— E> 2 ^ definition of £t{f) and the in the r.h.s of the above inequality is non-positive. By taking 

equality © and the above inequality ( b ) follows the Lemma max i m i zat i on 0 ver A, we have 
4. Similarly, 
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Then, we obtain 


E 
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max EPt^tW 
pJP t <P 0 t= 1 


< 4 kK 
— m 


nrj n + k 


In K 

Hri 


E - Pc 

t =1 


< 


n /- 

+ J2vt^yi<^kJn^\nK (50) 

t=i _ l _ 

+ 4 k\fnK In A'j (S n n + 2/rj n ). 


Let rj n = P n = = 0(n 1/2 ),l/r? = Ofn 1 / 2 ). 

Because 7 „ = 0(^4), the term EEi^t-v/Tt = 0(ln(n)). 
Thus, it can be omitted when compared to the first two 
terms in the r.h.s o f (l49l) . Moreover, in this setting, we set 
S„ = 2such that 2 kKr/ n < Then, 6„n = 

O (n 1 / 2 ). This completes the proof. ■ 


Proof of Theorem 1 7. 


Proof: The proof of Theorem 17 for adaptive adversary 
uses the same idea as in the proof of Theorem 2. Here, if we 
set the batch r = (4lnn) - ^n 3 in Theorem 2 of j29), 
we can get the regret upper bound in our Theorem 17. ■ 


B. Cooperative Learning of AOEECC in Stochastic Regime 


To obtain the tight regret performance for cooperative 
learning of AOEECC-EXP3++, we need to study and estimate 
the number of times each of channel is selected up to time n, 
i.f., N n (f). We summarize it in the following lemma. 


Lemma 5. In the multipath probing case, let {s n (f)}^ =1 
be non-increasing deterministic sequences, such that e n (f) < 
e n (f) with probability 1 and £ n (f) < £ n (/*) for all n and /. 
Define i '„(/) = E?=i kUJ) > and define the event 

mtA(f) - (!„(/*) -L n (f)) 


< V2K(/) + Vn{f*))bn + ( S ™)' 

Then for any positive sequence fq, &2> •••> and any n* > 2 the 
number of times channel / is played by AOEECC-EXP3++ 
up to round n is bounded as: 

n n 

E[JV„(/)] < (n*-i)+Ej- bt +kE^ t (f)t {3 f n} 

n 

+ E /- 

t=n* 

where 

M/) = mtA(f) - /2raf& n 




,(/) 


ke. 


1 ^ _ (j±j)&n 

ff*)J 3 ie n (/*) 


Proof: Note that AOEECC-EXP3++ probes L n strategies 
rather than 1 strategy each timeslot n. Let #{•} stands for 
the number of elements in the set {•}. Hence, 

E[N n (f)]=E[#{l<t<n:A t = f,£f}+ _ 

# jl < t < n : A t = /,&{}], 

where A t denotes the action of channel selection at timeslot 


t. By the following simple trick, we have 

E[N n (f)]=E[#{l<t<n:A t = f,£f}}+ _ 

E[#{l <t<n:A t = f,St}]] 

< E[53 l{l<t<n:A t =/}IP[#{^n}]] + 


E[E l { l< t <„:A t =/}P[#{^}]] ^ 


n 

t =1 


n t 

E[E 1{1 <t<n:A t =f}^[^Lt\]- 


Note that the elements of the martingale difference sequence 
in the {A(/) - (4(/) - 4(/*))}£L 1 by max{A(/) + 
4(/*)} = + !• Since e n (/*) < £„(/*) < l/(2n) < 

1/4, we can simplify the upper bound by using WJF) +l - 

(i+i) 


£»(/*)' 


We further note that 


Ej 


Hi 




( <E t <JmE (A(/)-&(/)-**(/*))) 

l t=i L 

<mEE t [(£(/)-£(/*)) 2 ' 

t=l L 

= m £ (E* [(H/H + £* fe(/*) 2 l) 

t= i \ I J L J/ 


1 } 


- m 'f ‘ 1 (&(/) + Stiff 

- TO t ?i (*^T77 + 

< ™ E (fci7(7) + fci^F)) = TOZ/ ™(/) + m Mf*) 


with probability 1. The above inequality (a) is because 
the number of probes for each channel / at timeslot 
t is at most m times, so does the accumulated value 
of the variance (A(/) — ( l t (f ) — I t (f*))) 2 . The above in¬ 
equality (b) is due to the fact that g n (f) > Pn{f) > 
12f£i £ n{f) \{i E C : f e i}\. Since each / only belongs to 
one of the covering strategies i G C, |{i G C : / G i}\ equals 
to 1 at time slot n if channel / is selected. Thus, p n {f) > 

EfeMf) = k £ n(f)- 


Let £f denote the complementary of event £f . Then by the 
Bernstein’t inequality P[£^] < f~ bn . According to (l5lt . the 
number of times the channel / is selected up to round n is 
bounded as: 

E[-/V n (/)] < £ F[A t = fl-UM-U] 

t= 1 _ _ 

+V[A t = f\ ZIYPI-U] 

< EP[^ = /|5f_ 1 ]l {H / r+PtSf-i] 

t= 1 t ~ lJ 

n , 

< znAt = mUt {s{ i} +/- 6 ‘-l 

t=i * _i 
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We further upper bound P [A t = y as follows: 

F[A t ~ p t (f)l {s{ _ i} 

< (' Qt(f ) + ke t {f))t {s f_ i} 

= C ks t (f ) 


Wt-i _ 

= (fc £f (/) + t:r f -^_ lW )w 


(a) 


E" i /' 




<(M/) + , 

= (&£*(/) 


(<0 . 




< fce t (/)l {H / } + e -^ fi A/) e -^A t (r t (r)-r t (/)) 

The above inequality (a) is due to the fact that channel / 
only belongs to one selected strategy i at t, inequality (b) is 
because the cumulative regret of each strategy is great than the 
cumulative regret of each channel that belongs to the strategy, 
inequality (c) is due to the fact that e t (f) is a non-increasing 
Substitution of this result back into 


sequence v t (f) < kTjj)- 

the computation of E[N n (f )} completes the proof. ■ 

Proof of Theorem 18. 

Proof: The proof is based on Lemma 5. Let h n = 
ln(nA(f) 2 ) and e n (f) = £ n (/). For any c > 18 and 
any n > n*, where n* is the minimal integer for which 
we have 


* > 4c 2 n In (n* A(/) 2 ) 2 
— m 2 A(/) 4 ln(n) 


M/) = mtA(f) - 
> mtA(f) — 2 


■\J2mtbn ( 


mtbn 


* mtA(f)(l - -fa - 


(a) 


k£n(f) 
2 

Ckc 


_ i _|_ 

ke n {f) ' ke. 

_ (i+i) b n 

3£n(/) 


.(/*)) 


3£n(/*) 


> mnA(/)(l - ^ > \mtA{f), 

where £„(/) = \ The transmission power is quasi- 

concave to the reward such that the cooperative learning strat- 
egy has ( -h t (f ) + A t (f t (/) - f t (/*))) < 0. By substitution 
of the lower bound on h n (f) into Lemma 5, we have 

!![»„(/)] < + + g (f -=^v /3S ^ 


< k 


cln (n) 


ln(n) 

mi U7 ACfj 7 ' r ^V m 2 a(/) 2 , 

where lemma 3 is used to bound the sum of the exponents. In 


0(- 

v ? 


r )+»*, 


kn 


addition, please note that n* is of the order 0 ( m 2 A (j) 4 ln ^ ;• 
The rest of the proof follows the same line in the proof of 
the Theorem 3. Thus, we complete the proof. ■ 

Proof of Theorem 19-Theorem 21. The proofs of Theorem 
19-Theorem 21 use similar idea as in previous proofs. We 
omitted here for brevity. 


IX. Implementation Issues and Simulation Results 

A. Computational Efficient Implementation of the AOEECC- 
EXP3++ Algorithm 

The implementation of Algorithm 1 requires the computa¬ 
tion of probability distributions and storage of N strategies, 
which has a time and space complexity 0(K k ). As the number 


of channels increase, the strategy will become exponentially 
large, which is very hard to be scalable and results in low 
efficiency. To address this important problem, a computational 
efficient enhanced algorithm is proposed by utilizing the dy¬ 
namic programming techniques. The key idea of the enhanced 
algorithm is to select the transmitting channels one by one until 
k channels are chosen, instead of choosing a strategy from the 
large strategy space in each timeslot. Interesting readers can 
find details in l3Ql [251. The linear time and space complexity 
are achievable for AOEECC-EXP3++, which is highly efficient 
and can be easily implemented in practice. 


B. Simulation Results 

We evaluate the performance of our AOEECC-EXP3++ 
Algorithm on a cognitive radio system which contains 16 
nodes and 8 USRP devices. There is line-of-sight path between 
the two nodes of a path at a specific distance, which was 
varied for different experiments ranging from 10 meters to 60 
meters with fixed topology. We conduct all our experiments 
on our own built system. The maximum transmission rate for 
each sensor node ranges from 20bps to 240kbps. We use the 
USRPs as CR nodes and the sensor nodes as the PUs. There 
are 32 channels available for the PUs in PC. The transmission 
bandwidth of PUs are 4 MHz, while the bandwidth of each 
USRP with 4 SPA radios (channel) is 350 kHz. The RF 
performance of a single channel is operating at 3.5 Ghz with 
the receive noise figure less than 8dB, and the the maximum 
output power of each USRP device is 11.5<iBM, and the 
average transmission power is about 8.63 dBM. We only count 
the average measured circuit and processing power that is 
related to data transmission, which is about 46.7 dBm. We set 
the P G of each SU to be 9.24 dBM. We implement our SPA 
models and algorithms that builds up on the software suit built 
upon GNU radio. We assume that all the SU will agree upon a 
common control channel (CCC), where the channel 17 is used 
as the CCC. We take e = 1 to get the maximum achievable 
EE. 

In Fig. 3, W.l.o.g., we normalize the EE into unitary value 
in every timeslot n. Then, we have M = 1, k = 4 and 
K = 32. All computations of the collected datasets were 
conducted on an off-the-shelf desktop with dual 6-core Intel 
i7 CPUs clocked at 2.66Ghz. To show the advantages of our 
AOEECC-EXP3++ algorithms, we compare their performance 
to other existing MAB based algorithms, which includes: the 
EXP3 based combinatorial version (implemented by ourselves) 
of the e-SPA for non-stochastic MABs in CC 1221 . and we 
named it as “e-SPA-EXP3”; The combinatorial stochastic 
MAB algorithm, i.e., “CombUCBl”, with the tight regret 
bound as proved in ED, and the cooperative learning versions 
of algorithms of ours and others. In Fig. 3, the solid lines in the 
graphs represent the mean performance over the experiments 
and the dashed lines represent the mean plus on standard 
deviation (std) over the ten repetitions of the corresponding 
experiments. For a given optimal channel access strategy, 
small regret values indicate the large value of EE. We set all 
versions of our AOEECC-EXP3++ algorithms parameterized 
by &(/) = fff ’ where A(/) is the empirical estimate 
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Fig. 3: Performance Comparison in Different Regimes. 


of A t(f), and parameters r]t and S t according to the theorems. 

In our first group of experiments in the stochastic regime 
(environment) as shown in Fig. 1(a), it is clear to see that 
AOEECC-EXP3++ enjoys almost the same (cumulative) re¬ 
grets as CombUCBl and has much lower regrets over time 
than the adversarial e-SPA-EXP3. We also see the significantly 
regrets reduction when accelerated learning (m = 6,16) is 
employed for both AOEECC-EXP3++ and CombUCBl. For 
the subplot of the violation of budgeted constraint, we also 
see very similar behaviors among all algorithms for a fixed 
setting of the CC topology. 

In our second group of experiments in the moderately 
contaminated stochastic environment, there are several con¬ 
taminated timeslots as labeled in Fig. 1(b), which is made 
by irregular jamming behaviors at some rounds. In this case, 
the contamination does not make the whole dataset be fully 
adversarial, but drawn from a different stochastic model. 
Despite the corrupted rounds the AOEECC-EXP3++ algorithm 
successfully returns to the stochastic operation mode and 
achieves better results than e-SPA-EXP3 and has very close 
and comparable performance as CombUCBl. We also see the 
cooperative learning is highly efficient for all algorithms. 

We conducted the third group of experiments in the adver¬ 
sarial regimes. We present the oblivious adversary case in Fig. 
1(c). Due to the strong interference effect on each channel and 
the arbitrarily changing feature of the jamming behavior, all 


A Path-loss Exponenet of 3 

—®— AOEECC-EXP3++, m=1, K= 8 
—AOEECC-EXP3++, m=1, K= 16 
— AOEECC-EXP3++, m=1, K= 32 
1.5^ AOEECC-EXP3++, m=6, K= 8 
>— AOEECC-EXP3++, m=6, K= 16 
*— AOEECC-EXP3++, m=6, K= 32 
AOEECC-EXP3++, m=16, K= 32 
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d [m] 

Fig. 4: EE loss of AOEECC-EXP3++ in a Path-loss Model 


algorithms experience very high accumulated regrets. It can 
be find that our AOEECC-EXP3++ algorithm will have close 
and slightly worst learning performance when compared to 
6-SPA-EXP3, which confirms our theoretical analysis. Note 
that we do not implement stochastic MAB algorithms such as 
CombUCBl, since it is not applicable in this regime. 

In our fourth set of experiments shown in Fig. 1(d), we 
simulate the adaptive jamming attack case in the adversarial 
regime with a typical large memory 0 = 20. We can see large 
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and then provided solid theoretical analysis for each of them. 
We find that our formulated constrained regret minimization 
problem requires joint control of learning rate and exploration 
parameters to achieve best performance. We have also found 
and verified that cooperative learning is an effective approach 
to improve the performance of EECC. Extensive simula¬ 
tions were conducted to verify the learning performance. 
The proposed algorithm could be implemented efficiently in 
practical CC with different sizes. We believe that the idea 
and algorithms of this paper can be applied to other wireless 
communications problems in unknown environments. 
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Fig. 5: EE loss of AOEECC-EXP3++ in Rician Distribution 

performance degradations for all algorithms when compared 
to the oblivious jammer case. The multiplicative effect of 0 
makes the AOEECC-EXP3++ and e-SPA-EXP3 very hard to 
combat this type of jamming attack, although the regret curve 
is still sublinear after normalization. 

We also compare the average EE loss of the proposed 
AOEECC-EXP3++ algorithm (after a run of 10 7 rounds) 
with respect to the optimal solution for 100 random channel 
realizations with a path-loss exponent of 3, a noise figure of 7 
dB, a carrier frequency of 3.5 GHz, a noise bandwidth of 10 
MHz, and the average circuit power = 29.2 dBm for 

each transmitting channel /. The result is shown in Fig. 4. We 
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